The dataset I am going to explore is provided by Dirk Rossmann GmbH, Germany’s second largest drugstore chain, as part of a regression competition on Kaggle.com.
The dataset contains daily entries for three years of over one thousand of their stores detailing the number of customers served, and total sales amount as well as information about the day (such as whether or not it was a holiday ). Useful information about the stores (such as the distance to the nearest competitor and store type) is also included in a separate file.
## 'data.frame': 1115 obs. of 10 variables:
## $ Store : int 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : Factor w/ 4 levels "a","b","c","d": 3 1 1 3 1 1 1 1 1 1 ...
## $ Assortment : Factor w/ 3 levels "a","b","c": 1 1 1 3 1 1 3 1 3 1 ...
## $ CompetitionDistance : int 1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
## $ CompetitionOpenSinceMonth: int 9 11 12 9 4 12 4 10 8 9 ...
## $ CompetitionOpenSinceYear : int 2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
## $ Promo2 : int 0 1 1 0 0 0 0 0 0 0 ...
## $ Promo2SinceWeek : int NA 13 14 NA NA NA NA NA NA NA ...
## $ Promo2SinceYear : int NA 2010 2011 NA NA NA NA NA NA NA ...
## $ PromoInterval : Factor w/ 4 levels "","Feb,May,Aug,Nov",..: 1 3 3 1 1 1 1 1 1 1 ...
This dataframe contains the store-specific information. Store, StoreType, Assortment, and Promo2 and PromoInterval are nominal variables.
These are fairly self-explanatory, except Promo2 which denotes the presence of regular monthly promotions (aside from day-specific promotions in the sales dataframe). PromoInterval specifies the specific combination of months in which this promotion is active. There are only 4 different combinations. Assortment and StoreType characterize each store but their meanings are not specified to us by the data provider.
The combination of Promo2SinceWeek, and Promo2SinceYear is an ordered variable describing the week in which Promo2 became active in the store. The same can be said of the combination of CompetitionOpenSinceMonth and CompetitionOpenSinceYear which together make the ordered variable for the month in which the closest competitor became active.
CompetitionDistance is the only continuous variable in this dataframe, and this denotes the distance to the closest competitor. It is specified in meters.
## 'data.frame': 1017209 obs. of 9 variables:
## $ Store : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Date : Factor w/ 942 levels "2013-01-01","2013-01-02",..: 942 942 942 942 942 942 942 942 942 942 ...
## $ Sales : int 5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
## $ Customers : int 555 625 821 1498 559 589 1414 833 687 681 ...
## $ Open : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StateHoliday : Factor w/ 4 levels "0","a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
## $ SchoolHoliday: int 1 1 1 1 1 1 1 1 1 1 ...
This dataframe contains the daily data for each of the stores. The Store feature contains the id which allows us to match from sales data to store-specific data.
Here, Promo, Open, StateHoliday, and SchoolHoliday are nominal variables. Promo is binary and denotes the presence of promotions that day. Open is also binary and indicates whether the store was open. StateHoliday has 4 values which the dataset provider defines in the readme as: a = public holiday, b = Easter holiday, c = Christmas, 0 = None.
DayOfWeek and Date are ordered variables but here Date is taken as nominal. We should fix this:
Sales and Customers are the quantitative variables of interest for this project. They are both integer valued in this case but Customers is obviously discrete.
My guiding questions for the exploration will be:
There do seem to be some cyclical patterns but it is hard to to see all at once. Let’s look at a couple of months.
In this plot, the breaks on the x-axis are weekly and it is clear that the peaks occur on Mondays. Interestingly, there seem to be two levels of peaks, occuring on an alternating basis on Mondays. Why are there two different peaks?
What if we look at the number of total promotions per day for the same period.
Ah! This is interesting. The large peaks correspond to Mondays where the promotions were active, the low peaks to Mondays without promotions.
Let’s now take a look at the total sales grouped by day of week to see any persisting weekly trends.
Day 7 of course is Sunday and so that low value probably corresponds to closed stores. Let’s confirm that.
The y-axis numbers show the total number of stores open on that day of the week over the whole 3 year period. We saw in the tables above, however that not all the weekdays are represented equally in the data and more importantly, stores arent always open. So lets take a look at the mean sales per day of week considering only open stores.
Well this is interesting. Sundays have almost the same mean as Mondays. It’s just that less stores are open on Sundays. Saturday actually has the lowest mean sales.
It seems quite a sizeable increase for most of the stores. How much more of a relative increase is it? Let’s make a histogram of the relative increases.
Lets look at some summary statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9383 1.6900 1.8400 1.8530 1.9990 3.2580
The mean is 1.85 which is to say that, on average, total sales on promotion days are almost twice that of non-promotion days. There is seemingly a single store whose average with promotions is less! The data seems fairly normally distributed, but we can look at a QQ-plot to confirm.
It’s not quite normal due to fatter tails on both sides. Perhaps other variables can shed some light.
For example, does the presence of regular monthly promotions (Promo2) have any effect on the efficacy of daily promotions (Promo)?
##
## 0 1
## 544 571
Not too much from the looks of it. Does the store type have any bearing in this?
It seems so! Specifically, stores of type B seem to be hardly affected by promotions in comparison to other types.
What about the assortment?
Interestingly it seems as though assortments type b has a similar distribution to stores of type b. Lets look at them:
## Assortment StoreType
## 85 a b
## 259 b b
## 262 a b
## 274 b b
## 335 a b
## 353 b b
## 423 a b
## 494 a b
## 512 b b
## 562 c b
## 676 b b
## 682 a b
## 733 b b
## 769 b b
## 948 b b
## 1081 a b
## 1097 b b
Indeed, it seems that stores with ‘b’ class Assortments all have a StoreType of ‘b’, but not the other way around. These designations are not explained by the data source, but such correlation is useful to know about.
Some differences there. Specifically, it seems again that type ‘b’ is indicative of high throughput. But this may not be very helpful. Particularly because many of the combinations of store type and assortment have such little representation.
##
## a b c
## 593 9 513
##
## a b c d
## 602 17 148 348
But how are daily sales per storetype distributed? Let’s look at some density curves.
The distribution is fairly long-tailed. Let’s look at this in log scale.
Again, storetype b is quite distinctive with more probability density for higher sales. Storetypes a and c have very similar distributions, wheras d is slightly different. Let’s do the same for the total number of customers served.
Customers and Sales should be about linearly correlated.
As expected, the relationship is linear but there is an interesting clustering by storetype. For example, its clear here that storetype ‘d’ is indicative of higher sales per customer. Let’s look at the distributions of sales per customer for each storetype.
While sales and customer throughput are higher for storetype ‘b’, the average sales per customer is actually the lowest! And now we can see that storetype ‘d’ is actually making the most per customer by what seems like a significant margin.
I’ll first look at a simple scatterplot.
The data is fairly long-tailed in both axes. Also, it’s hard to see any trends because each store has a fixed competition distance but varying daily cusotmers, so we have many veritcal lines. Let’s look at the total sales per store instead, and use a log-log scale.
Lets calculate the correlation coefficient (ignoring missing values):
## [1] -0.2935466
What if we use the mean daily sales on the y-axis instead of total per store? That way we can ignore differences in how often stores remain open for business.
Now the correlation coefficient is:
## [1] -0.3168738
There is a somewhat clearer inverse linear trend there. I think competition distance may have more of an effect on public holidays? Let’s see.
And quantitatively the correlation coefficient is:
## [1] -0.3334342
Once again, slighlty more pronounced but quite possibly a spurious relationship. Hypothesis testing would be required to really confirm the significance of the difference.
As a sanity check, let’s see how many stores are still open per holiday type.
Okay, this is as expected. On regular days (holiday type ‘0’) almost all of the stores are open and a very small fraction stay open on holidays. What store types make up the open stores for each holiday type?
This is interesting, especially because stores of type ‘b’ are relatively few. Let’s look at the proportion of each type that stays open instead.
Oh, it seems we’ve uncovered something about stores of type ‘b’. Almost all of them stay open on holidays!
Finally, on average, how much does each store type make per day on each holiday type.
First, I wanted to explore that aggregate cyclical trend we saw on sales per day. And I wanted to clearly show the effect of the promotions on the peaks of that trend.
This plot shows a very clear weekly cycle. When there is a promotion (shown by blue shading), sales peak on Mondays and dip throughout the week, experiencing a slight rise again on Fridays. In weeks wihtout promotions, sales are relatively steady throughout the week, with very slight peaks on Mondays and Fridays. There are deviations from this cycle during the year. For example, Friday April 18th and Monday April 21st are Easter Holidays so have no sales, and the unexpectedly high peak on Dec 22nd is clearly due to the Christmas effect.
The effect of promotions on sales seemed to be dependant on the store type. Here ive reproduced what I think was very informative plot from the exploration section.
The most obvious bit of information shown in this graph is that stores of type ‘b’ seem to be the least affected by promotions. Here, we also see that store which does slightly worse during promotions.
Stores of type ‘a’ are the most affected by promotions, which may be good news because that is also the most numerous type. The black line shows us the overall effect, and we can see that it is to roughly double sales.
There is also a curious bimodality to the distributions (particularly for types b and c) which may be explained by an interaction with some other variable.
Finally I wanted to address the question of whether stores make more average sales on holidays, and show the interaction of storetype.
Here we can see some stark differences among store types. Store type b has higher average daily sales for all holidays (very slight difference on Christmas). Overall average sales are significantly higher for all holidays. This is probably due to the fact that stores of type ‘b’ (which have higher average sales in any case) represent the lion’s share of open stores on holidays.
Stores of type c are not open on Christmas or Easter, but have a slight increase in sales on other public holidays. On the other hand, store types a and d do significantly worse on all holidays.
One of the more interesting aspects about this dataset is that it was provided by the company as part of an ongoing regression competition. The task is to produce a model given the observed data that accurately predicts Sales for given stores on given dates.
This naturally led to a slight bias on the part of the exploratory analysis to try and identify any regularities in sales and obvious relationships to other variables.
I was surprised at some of the relationships (and lack therof) in the data. For example, I expected the distance to the nearest competitor for each store to be positively correlated to its average daily sales. That is, drug stores should sell more if they fewer nearby competitors. This turned out to be false. In fact there was a small negative correlation. In hindsight this also makes sense since it suggests that areas with more drug stores tend to have higher daily sales.
A not so surprising finding was the positive effect of promotions on total sales. When compiling the final plot for the cyclical trends and effects of promotions, I considered aggregating the entire three year period into a single plot. I decided to focus on just 2014, because simply aggregating by day of month would have obscured the weekly cycle.
It would have been very interesting to have been able to map the stores geographically and explore any trends there. This would have also shed some light on the effects of nearby competitors discussed above. However, the stores were intentionally anonymized, along with their characteristics (i.e. StoreType, Assortment).
Despite the anonymization of store types, some clear differences were identified. For example, stores of type ‘b’ remain mostly open on holidays and have higher throughput (sales and customers), suggesting that they are larger stores. They are also less affected by promotions. This information would, no doubt, be useful in the regression challenge.
My analysis focused on some guiding questions but did not make use of all of the variables in the dataset, such as the date the closest competitor became active, and the monthly promotion intervals. These may turn out to contain hidden patterns and should ideally be explored further.